Yule–Simon distribution (yulesimon)#

The Yule–Simon distribution is a heavy-tailed discrete distribution on the positive integers. It’s a classic model for rich-get-richer / preferential attachment dynamics: a few categories become very large, while most stay small.

Learning goals#

  • Recognize when Yule–Simon is a reasonable model for count data with a power-law tail.

  • Write the PMF/CDF in terms of Beta/Gamma functions (and understand the tail behavior).

  • Know which moments exist (and when the mean/variance/skewness/kurtosis are infinite).

  • Derive the expectation and variance via a Beta–geometric mixture representation.

  • Implement sampling with NumPy only and validate it with Monte Carlo simulation.

  • Use scipy.stats.yulesimon and scipy.stats.fit for evaluation, simulation, and parameter estimation.

Prerequisites#

  • Discrete probability (PMF/CDF), expectation, variance

  • Comfort with logs and basic calculus

  • Familiarity with the Gamma/Beta functions is helpful (but introduced as needed)

Notebook roadmap#

  1. Title & Classification

  2. Intuition & Motivation

  3. Formal Definition

  4. Moments & Properties

  5. Parameter Interpretation

  6. Derivations (Expectation, Variance, Likelihood)

  7. Sampling & Simulation (NumPy-only)

  8. Visualization (PMF, CDF, Monte Carlo)

  9. SciPy Integration

  10. Statistical Use Cases

  11. Pitfalls

  12. Summary

import numpy as np

import plotly.graph_objects as go
import os
import plotly.io as pio

from scipy import optimize, special, stats

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=6, suppress=True)

SEED = 7
rng = np.random.default_rng(SEED)
import sys
import scipy
import plotly

print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__)
print("SciPy:", scipy.__version__)
print("Plotly:", plotly.__version__)
print("Seed:", SEED)
Python: 3.12.9
NumPy: 1.26.2
SciPy: 1.15.0
Plotly: 6.5.2
Seed: 7

1) Title & Classification#

  • Name: yulesimon (Yule–Simon distribution)

  • Type: Discrete

  • Support (SciPy loc=0): (k \in {1,2,3,\dots})

    • With a location shift loc, support becomes (k \in {\mathrm{loc}+1,, \mathrm{loc}+2,, \dots}).

  • Parameter space: (\alpha > 0) (shape)

Notation:

  • (K \sim \mathrm{YuleSimon}(\alpha)).

  • SciPy uses the shape parameter name alpha: scipy.stats.yulesimon(alpha, loc=0).

2) Intuition & Motivation#

What this distribution models#

The Yule–Simon distribution models positive integer counts with a power-law (heavy) tail:

  • many observations are small (1, 2, 3, …)

  • a non-negligible fraction are very large

This pattern often comes from cumulative advantage dynamics:

items that are already frequent are more likely to be observed again.

Typical real-world use cases#

  • Word frequencies in text (a few words are very common, most are rare)

  • Citations of scientific papers

  • In-degree in networks with preferential attachment (links to popular pages keep accumulating)

  • Popularity / sales counts (blockbusters vs the long tail)

Relations to other distributions#

  • Power laws / Pareto / Zipf: Yule–Simon is a discrete distribution with a power-law tail.

  • Mixture of geometric distributions: if (P \sim \mathrm{Beta}(\alpha,1)) and (K\mid P \sim \mathrm{Geometric}(P)) on ({1,2,\dots}), then (K) is Yule–Simon.

  • Simon’s model (preferential attachment with innovation) yields Yule–Simon-like frequency distributions.

3) Formal Definition#

Let (K \sim \mathrm{YuleSimon}(\alpha)) with (\alpha>0) and support (k=1,2,\dots).

PMF#

A standard closed form uses the beta function (B(\cdot,\cdot)):

[ \mathbb{P}(K=k\mid \alpha) = \alpha,B(k,\alpha+1), \qquad k=1,2,\dots ]

Using (B(x,y)=\frac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)}), this is equivalent to

[ \mathbb{P}(K=k\mid \alpha) = \alpha,\frac{\Gamma(k)\Gamma(\alpha+1)}{\Gamma(k+\alpha+1)}. ]

CDF (and survival function)#

Because this is discrete, for real (x),

[ F(x)=\mathbb{P}(K \le x)=\mathbb{P}(K \le \lfloor x\rfloor). ]

For integer (k\ge 1), a convenient closed form is

[ F(k)=1-\mathbb{P}(K>k) =1-\alpha,B(k+1,\alpha) =1-k,B(k,\alpha+1). ]

We will use the survival function form for numerical stability when (F(k)) is very close to 1.

import math


def validate_alpha(alpha: float) -> float:
    alpha = float(alpha)
    if not np.isfinite(alpha) or alpha <= 0:
        raise ValueError(f"alpha must be positive and finite; got {alpha!r}")
    return alpha


_lgamma = np.vectorize(math.lgamma, otypes=[float])


def yulesimon_logpmf(k, alpha: float) -> np.ndarray:
    '''Log-PMF of Yule–Simon(alpha) at integer k >= 1.

    Implemented with Python's `math.lgamma` (no SciPy special functions).
    '''

    alpha = validate_alpha(alpha)
    k = np.asarray(k)
    kf = k.astype(float)

    out = np.full_like(kf, fill_value=-np.inf, dtype=float)
    mask = (kf >= 1) & (kf == np.floor(kf))
    if np.any(mask):
        kv = kf[mask]
        out[mask] = (
            np.log(alpha)
            + _lgamma(alpha + 1.0)
            + _lgamma(kv)
            - _lgamma(kv + alpha + 1.0)
        )
    return out


def yulesimon_pmf(k, alpha: float) -> np.ndarray:
    '''PMF of Yule–Simon(alpha) at k (returns 0 outside the support).'''
    return np.exp(yulesimon_logpmf(k, alpha))


def yulesimon_logsf(k, alpha: float) -> np.ndarray:
    '''Log survival function log P(K > k) for integer k >= 0.

    Uses: P(K > k) = alpha * B(k+1, alpha).
    '''

    alpha = validate_alpha(alpha)
    k = np.asarray(k)
    kf = k.astype(float)

    out = np.full_like(kf, fill_value=-np.inf, dtype=float)

    # For any k < 1, P(K > k) = 1 because the support starts at 1.
    out[kf < 1] = 0.0

    mask = (kf >= 1) & (kf == np.floor(kf))
    if np.any(mask):
        kv = kf[mask]
        out[mask] = (
            np.log(alpha)
            + _lgamma(kv + 1.0)
            + _lgamma(alpha)
            - _lgamma(kv + alpha + 1.0)
        )

    return out


def yulesimon_sf(k, alpha: float) -> np.ndarray:
    return np.exp(yulesimon_logsf(k, alpha))


def yulesimon_cdf(x, alpha: float) -> np.ndarray:
    '''CDF evaluated at real x via floor(x).'''

    alpha = validate_alpha(alpha)
    x = np.asarray(x)
    k = np.floor(x)

    out = np.zeros_like(k, dtype=float)
    mask = k >= 1
    if np.any(mask):
        out[mask] = 1.0 - yulesimon_sf(k[mask], alpha)

    return out
# Quick sanity checks vs SciPy

alpha = 2.5
k = np.arange(1, 11)

pmf_np = yulesimon_pmf(k, alpha)
pmf_sp = stats.yulesimon.pmf(k, alpha)
print("max |pmf numpy - scipy|:", float(np.max(np.abs(pmf_np - pmf_sp))))

x = np.array([-3.0, 0.0, 0.9, 1.0, 2.0, 5.0, 10.0])
cdf_np = yulesimon_cdf(x, alpha)
cdf_sp = stats.yulesimon.cdf(x, alpha)
print("max |cdf numpy - scipy|:", float(np.max(np.abs(cdf_np - cdf_sp))))

# PMF should sum to 1; we can check this by truncating and adding the exact tail mass.
K = 2000
mass_trunc = yulesimon_pmf(np.arange(1, K + 1), alpha).sum()
tail_mass = yulesimon_sf(K, alpha)
print("mass_trunc + tail_mass:", float(mass_trunc + tail_mass))
max |pmf numpy - scipy|: 7.771561172376096e-16
max |cdf numpy - scipy|: 1.1102230246251565e-16
mass_trunc + tail_mass: 0.9999999999999994

4) Moments & Properties#

Existence of moments#

The Yule–Simon distribution is heavy-tailed. In fact, the (m)-th raw moment exists iff

[ \mathbb{E}[K^m] < \infty \quad \Longleftrightarrow \quad \alpha > m. ]

So:

  • mean exists only for (\alpha>1)

  • variance exists only for (\alpha>2)

  • skewness exists only for (\alpha>3)

  • (excess) kurtosis exists only for (\alpha>4)

Mean, variance, skewness, kurtosis#

For (\alpha>1),

[ \mathbb{E}[K] = \frac{\alpha}{\alpha-1}. ]

For (\alpha>2),

[ \mathrm{Var}(K)=\frac{\alpha^2}{(\alpha-1)^2(\alpha-2)}. ]

For (\alpha>3), the skewness is

[ \gamma_1 = \frac{(\alpha+1)^2\sqrt{\alpha-2}}{\alpha(\alpha-3)}. ]

For (\alpha>4), the excess kurtosis is

[ \gamma_2 = \frac{\alpha^4 + 7\alpha^3 - 9\alpha^2 - 13\alpha - 22}{\alpha(\alpha-3)(\alpha-4)}. ]

(The kurtosis is (\gamma_2 + 3).)

PGF / MGF / characteristic function#

A useful analytic object for discrete distributions is the probability generating function (PGF)

[ G(z)=\mathbb{E}[z^K], \qquad |z|<1. ]

For Yule–Simon,

[ G(z)=\frac{\alpha z}{\alpha+1},{}_2F_1(1,1;\alpha+2;z), ]

where ({}_2F_1) is the Gauss hypergeometric function.

  • The MGF is (M(t)=\mathbb{E}[e^{tK}]=G(e^t)), which is finite for (t<0) (since (e^t<1)). It diverges for any (t>0) because of the power-law tail.

  • The characteristic function is (\varphi(t)=\mathbb{E}[e^{itK}]=G(e^{it})). It exists for all real (t), though numerical evaluation at (t=0) should be handled carefully.

Entropy#

The Shannon entropy is

[ H(K)=-\sum_{k=1}^\infty p(k)\log p(k). ]

It is finite for all (\alpha>0), but (to my knowledge) there is no simple closed form; we can evaluate it numerically.

def yulesimon_theoretical_stats(alpha: float) -> dict:
    alpha = validate_alpha(alpha)

    mean = np.inf if alpha <= 1 else alpha / (alpha - 1)
    var = np.inf if alpha <= 2 else alpha**2 / ((alpha - 1) ** 2 * (alpha - 2))

    skew = (
        np.inf
        if alpha <= 3
        else ((alpha + 1) ** 2 * np.sqrt(alpha - 2)) / (alpha * (alpha - 3))
    )

    excess_kurt = (
        np.inf
        if alpha <= 4
        else (alpha**4 + 7 * alpha**3 - 9 * alpha**2 - 13 * alpha - 22)
        / (alpha * (alpha - 3) * (alpha - 4))
    )
    kurt = np.inf if alpha <= 4 else excess_kurt + 3

    return {
        "alpha": alpha,
        "mean": mean,
        "var": var,
        "skew": skew,
        "kurt": kurt,
        "excess_kurt": excess_kurt,
    }


def yulesimon_pgf(z, alpha: float):
    '''Probability generating function G(z)=E[z^K] for |z|<1.'''

    alpha = validate_alpha(alpha)
    z = np.asarray(z, dtype=complex)
    return (alpha * z / (alpha + 1.0)) * special.hyp2f1(1.0, 1.0, alpha + 2.0, z)


def yulesimon_mgf(t, alpha: float):
    '''MGF M(t)=E[e^{tK}] (finite for t<0).'''

    t = np.asarray(t, dtype=float)
    if np.any(t >= 0):
        raise ValueError("MGF diverges for t >= 0 (heavy tail); use t < 0")
    return yulesimon_pgf(np.exp(t), alpha)


def yulesimon_cf(t, alpha: float):
    '''Characteristic function phi(t)=E[e^{itK}].'''

    t = np.asarray(t, dtype=float)
    out = yulesimon_pgf(np.exp(1j * t), alpha)
    out = np.where(t == 0, 1.0 + 0j, out)
    return out


for a in [1.5, 2.5, 5.0]:
    mvsk = stats.yulesimon.stats(a, moments="mvsk")
    ent = stats.yulesimon.entropy(a)

    print()
    print(f"alpha={a}")
    print("theory (closed form when finite):", yulesimon_theoretical_stats(a))
    print("SciPy stats (mean, var, skew, kurt):", tuple(float(x) for x in mvsk))
    print("SciPy entropy:", float(ent))

# MGF spot-check for a negative t
alpha = 3.5
t = -0.3
print()
print("MGF(t) at t=-0.3 (complex rounding):", complex(yulesimon_mgf(t, alpha)))
print("CF(t) at t=0.7:", complex(yulesimon_cf(0.7, alpha)))
alpha=1.5
theory (closed form when finite): {'alpha': 1.5, 'mean': 3.0, 'var': inf, 'skew': inf, 'kurt': inf, 'excess_kurt': inf}
SciPy stats (mean, var, skew, kurt): (3.0, inf, nan, nan)
SciPy entropy: 1.503896885539161

alpha=2.5
theory (closed form when finite): {'alpha': 2.5, 'mean': 1.6666666666666667, 'var': 5.555555555555555, 'skew': inf, 'kurt': inf, 'excess_kurt': inf}
SciPy stats (mean, var, skew, kurt): (1.6666666666666667, 5.555555555555555, inf, inf)
SciPy entropy: 1.0232863517009578

alpha=5.0
theory (closed form when finite): {'alpha': 5.0, 'mean': 1.25, 'var': 0.5208333333333334, 'skew': 6.235382907247958, 'kurt': 121.8, 'excess_kurt': 118.8}
SciPy stats (mean, var, skew, kurt): (1.25, 0.5208333333333334, 6.235382907247958, 118.8)
SciPy entropy: 0.6061305592752749

MGF(t) at t=-0.3 (complex rounding): (0.6798440777897914+0j)
CF(t) at t=0.7: (0.5734322687509957+0.6747782631823974j)
/home/tempa/miniconda3/lib/python3.12/site-packages/scipy/stats/_discrete_distns.py:1834: RuntimeWarning:

invalid value encountered in sqrt

/home/tempa/miniconda3/lib/python3.12/site-packages/numpy/lib/function_base.py:2410: RuntimeWarning:

expect(): sum did not converge

/home/tempa/miniconda3/lib/python3.12/site-packages/numpy/lib/function_base.py:2455: RuntimeWarning:

expect(): sum did not converge

5) Parameter Interpretation#

The single shape parameter (\alpha) controls tail heaviness.

  • Smaller (\alpha) (\Rightarrow) heavier tail (large counts become more likely).

  • Larger (\alpha) (\Rightarrow) faster decay and more mass near 1.

Two useful facts:

  1. Tail exponent: as (k\to\infty),

[ \mathbb{P}(K=k) \sim \alpha,\Gamma(\alpha+1),k^{-(\alpha+1)}. ]

So (\alpha) directly controls the power-law exponent.

  1. Moment thresholds: the mean/variance/etc exist only when (\alpha) is above the corresponding order. This matters in practice: for (\alpha \le 2), the empirical variance can be dominated by rare extreme samples.

# Shape changes as alpha varies

alphas = [0.7, 1.2, 2.5, 5.0, 10.0]
k = np.arange(1, 60)

fig = go.Figure()
for a in alphas:
    fig.add_trace(
        go.Scatter(
            x=k,
            y=stats.yulesimon.pmf(k, a),
            mode="lines+markers",
            name=f"alpha={a}",
        )
    )

fig.update_layout(
    title="Yule–Simon PMF for different α (log–log view)",
    xaxis_title="k",
    yaxis_title="P(K=k)",
)
fig.update_xaxes(type="log")
fig.update_yaxes(type="log")
fig.show()

6) Derivations#

A very convenient representation is a Beta–geometric mixture.

6.1 PMF via a mixture#

Let

  • (P \sim \mathrm{Beta}(\alpha, 1)) with density (f(p)=\alpha p^{\alpha-1}) on (0<p<1)

  • (K\mid P=p \sim \mathrm{Geometric}(p)) on ({1,2,\dots}), i.e. (\mathbb{P}(K=k\mid p)=p(1-p)^{k-1})

Then the marginal PMF is

[ \mathbb{P}(K=k) = \int_0^1 p(1-p)^{k-1},\alpha p^{\alpha-1},dp = \alpha\int_0^1 p^{\alpha}(1-p)^{k-1},dp = \alpha,B(k,\alpha+1). ]

6.2 Expectation#

Using the law of total expectation and the fact that (\mathbb{E}[K\mid P=p]=1/p),

[ \mathbb{E}[K] = \mathbb{E}[\mathbb{E}[K\mid P]] = \mathbb{E}\left[\frac{1}{P}\right] = \alpha\int_0^1 p^{\alpha-2},dp = \frac{\alpha}{\alpha-1}, ]

which is finite only for (\alpha>1).

6.3 Variance#

Using the law of total variance,

[ \mathrm{Var}(K)=\mathbb{E}[\mathrm{Var}(K\mid P)] + \mathrm{Var}(\mathbb{E}[K\mid P]). ]

For the geometric distribution on ({1,2,\dots}), (\mathrm{Var}(K\mid p)=(1-p)/p^2) and (\mathbb{E}[K\mid p]=1/p). So

[ \mathrm{Var}(K)=\mathbb{E}\left[\frac{1-P}{P^2}\right] + \mathrm{Var}\left(\frac{1}{P}\right) = \frac{\alpha^2}{(\alpha-1)^2(\alpha-2)}, ]

finite only for (\alpha>2).

6.4 Likelihood#

Given i.i.d. data (k_1,\dots,k_n) (each (k_i\in{1,2,\dots})), the log-likelihood is

[ \ell(\alpha) = \sum_{i=1}^n \log \mathbb{P}(K=k_i\mid \alpha) = n\log\alpha + n\log\Gamma(\alpha+1)

  • \sum_{i=1}^n \log\Gamma(k_i)

  • \sum_{i=1}^n \log\Gamma(k_i+\alpha+1). ]

Differentiating gives the score equation (using the digamma function (\psi=\Gamma’/\Gamma)):

[ \frac{d\ell}{d\alpha} = \frac{n}{\alpha} + n\psi(\alpha+1) - \sum_{i=1}^n \psi(k_i+\alpha+1) = 0, ]

which can be solved numerically for the MLE.

def yulesimon_loglik(alpha: float, data: np.ndarray) -> float:
    '''Log-likelihood for i.i.d. sample from Yule–Simon(alpha).'''

    alpha = validate_alpha(alpha)
    data = np.asarray(data)

    if np.any(data < 1) or np.any(np.floor(data) != data):
        raise ValueError("All observations must be integers >= 1")

    n = data.size
    return (
        n * np.log(alpha)
        + n * special.gammaln(alpha + 1.0)
        + np.sum(special.gammaln(data))
        - np.sum(special.gammaln(data + alpha + 1.0))
    )


def yulesimon_score(alpha: float, data: np.ndarray) -> float:
    '''Derivative of log-likelihood w.r.t. alpha.'''

    alpha = validate_alpha(alpha)
    data = np.asarray(data)
    n = data.size

    return float(
        n / alpha
        + n * special.digamma(alpha + 1.0)
        - np.sum(special.digamma(data + alpha + 1.0))
    )


# Fit alpha by MLE on simulated data
alpha_true = 2.5
n = 5000
data = stats.yulesimon.rvs(alpha_true, size=n, random_state=rng)

res = optimize.minimize_scalar(
    lambda a: -yulesimon_loglik(a, data), bounds=(1e-6, 50.0), method="bounded"
)
alpha_hat = float(res.x)

fit_res = stats.fit(
    stats.yulesimon,
    data,
    bounds={"alpha": (1e-6, 50.0), "loc": (0, 0)},
    guess={"alpha": alpha_hat, "loc": 0},
)

print("alpha_true:", alpha_true)
print("alpha_hat (MLE via minimize_scalar):", alpha_hat)
print("score(alpha_hat):", yulesimon_score(alpha_hat, data))
print("SciPy stats.fit params:", fit_res.params)
alpha_true: 2.5
alpha_hat (MLE via minimize_scalar): 2.466878956783674
score(alpha_hat): -0.0001069346399162896
SciPy stats.fit params: FitParams(alpha=2.4668784592208555, loc=0.0)

7) Sampling & Simulation (NumPy-only)#

Using the Beta–geometric mixture gives a simple sampler.

  1. Draw (U \sim \mathrm{Uniform}(0,1)) and set (P = U^{1/\alpha}). This works because if (P \sim \mathrm{Beta}(\alpha,1)), then (P = U^{1/\alpha}) in distribution.

  2. Given (P=p), draw (K\mid P=p \sim \mathrm{Geometric}(p)) on ({1,2,\dots}).

For the geometric step we can use inverse transform sampling:

[ K = 1 + \left\lfloor \frac{\log V}{\log(1-p)} \right\rfloor,\qquad V\sim \mathrm{Uniform}(0,1), ]

(where (\log(1-p)<0)).

def yulesimon_rvs_numpy(alpha: float, size: int, rng: np.random.Generator | None = None):
    '''Sample from Yule–Simon(alpha) using only NumPy + standard library math.

    Returns samples on {1, 2, ...}.
    '''

    alpha = validate_alpha(alpha)
    if rng is None:
        rng = np.random.default_rng()

    size = int(size)

    # Step 1: P ~ Beta(alpha, 1) via inverse transform: P = U^(1/alpha)
    u = rng.random(size)
    u = np.clip(u, np.finfo(float).tiny, 1.0)
    p = u ** (1.0 / alpha)

    # Step 2: K | P=p ~ Geometric(p) on {1,2,...}
    v = rng.random(size)
    v = np.clip(v, np.finfo(float).tiny, 1.0)

    # Use log1p(-p) for stability when p is small.
    k = 1 + np.floor(np.log(v) / np.log1p(-p))
    return k.astype(int)


alpha = 2.5
samples = yulesimon_rvs_numpy(alpha, size=200_000, rng=rng)

print("min/max:", int(samples.min()), int(samples.max()))
print("sample mean:", float(samples.mean()))
print("sample var:", float(samples.var()))
print("theoretical:", yulesimon_theoretical_stats(alpha))
min/max: 1 195
sample mean: 1.661185
sample var: 4.414319395774998
theoretical: {'alpha': 2.5, 'mean': 1.6666666666666667, 'var': 5.555555555555555, 'skew': inf, 'kurt': inf, 'excess_kurt': inf}

8) Visualization#

We’ll visualize:

  • the PMF (including a log–log view for tail behavior)

  • the CDF / survival function

  • Monte Carlo samples compared to the theoretical distribution

alpha = 2.5

# PMF/CDF for small k
k_small = np.arange(1, 31)
pmf = stats.yulesimon.pmf(k_small, alpha)
cdf = stats.yulesimon.cdf(k_small, alpha)

fig_pmf = go.Figure()
fig_pmf.add_trace(go.Bar(x=k_small, y=pmf, name="PMF"))
fig_pmf.update_layout(
    title=f"Yule–Simon PMF (alpha={alpha})",
    xaxis_title="k",
    yaxis_title="P(K=k)",
)
fig_pmf.show()

fig_cdf = go.Figure()
fig_cdf.add_trace(go.Scatter(x=k_small, y=cdf, mode="lines+markers", name="CDF"))
fig_cdf.update_layout(
    title=f"Yule–Simon CDF (alpha={alpha})",
    xaxis_title="k",
    yaxis_title="P(K ≤ k)",
)
fig_cdf.show()

# Monte Carlo vs theory (empirical CDF and survival)
mc = stats.yulesimon.rvs(alpha, size=80_000, random_state=rng)
mc_sorted = np.sort(mc)

k_tail = np.arange(1, 200)
emp_cdf = np.searchsorted(mc_sorted, k_tail, side="right") / mc_sorted.size
emp_sf = 1.0 - emp_cdf

theo_sf = stats.yulesimon.sf(k_tail, alpha)

fig_tail = go.Figure()
fig_tail.add_trace(go.Scatter(x=k_tail, y=theo_sf, mode="lines", name="theory sf"))
fig_tail.add_trace(
    go.Scatter(x=k_tail, y=emp_sf, mode="markers", name="empirical sf", opacity=0.6)
)
fig_tail.update_layout(
    title=f"Survival function on log–log scale (alpha={alpha})",
    xaxis_title="k",
    yaxis_title="P(K > k)",
)
fig_tail.update_xaxes(type="log")
fig_tail.update_yaxes(type="log")
fig_tail.show()

9) SciPy Integration#

SciPy implements this distribution as scipy.stats.yulesimon with shape parameter alpha and optional loc.

Common methods:

  • stats.yulesimon.pmf(k, alpha, loc=0)

  • stats.yulesimon.cdf(k, alpha, loc=0)

  • stats.yulesimon.rvs(alpha, loc=0, size=..., random_state=...)

For fitting, use scipy.stats.fit (note: the distribution object itself does not have a .fit method):

  • stats.fit(stats.yulesimon, data, bounds=..., method='mle')

alpha = 2.5
k = np.arange(1, 6)

print("pmf:", stats.yulesimon.pmf(k, alpha))
print("cdf:", stats.yulesimon.cdf(k, alpha))
print("rvs:", stats.yulesimon.rvs(alpha, size=10, random_state=rng))

# Fit alpha (and optionally loc). Here we fix loc=0.
data = stats.yulesimon.rvs(alpha, size=2000, random_state=rng)
fit_res = stats.fit(
    stats.yulesimon, data, bounds={"alpha": (1e-6, 50.0), "loc": (0, 0)}
)
print("fit params:", fit_res.params)
print("negative log-likelihood at fit:", float(fit_res.nllf()))
pmf: [0.714286 0.15873  0.05772  0.02664  0.014208]
cdf: [0.714286 0.873016 0.930736 0.957376 0.971584]
rvs: [1 2 1 1 1 4 1 7 1 2]
fit params: FitParams(alpha=2.5872600619798414, loc=0.0)
negative log-likelihood at fit: 1991.4834873154202

10) Statistical Use Cases#

10.1 Hypothesis testing / goodness-of-fit#

  • Goodness-of-fit: fit (\alpha) by MLE and compare observed frequencies to expected frequencies (e.g., a chi-square test with sensible binning, or a parametric bootstrap).

  • Model comparison: compare log-likelihoods / information criteria between heavy-tailed candidates (Yule–Simon, Zipf, negative binomial, etc.).

10.2 Bayesian modeling#

In Bayesian settings, (\alpha) can be given a prior (e.g., Gamma prior on (\alpha)) and inferred from data using:

  • grid approximation (1D problems)

  • MCMC / HMC (more complex models)

  • variational inference (large-scale)

The likelihood is easy to evaluate in log space, which is helpful for posterior computation.

10.3 Generative modeling#

Yule–Simon shows up as the stationary/limiting distribution of preferential attachment mechanisms. A classic example is Simon’s model: with probability (p_{\text{new}}) create a new category; otherwise, copy an existing category proportional to its current count. The induced category-size distribution has a Yule–Simon-like heavy tail.

# 10.1 Goodness-of-fit (illustration): chi-square with a tail bin

alpha_true = 2.5
n = 5000
x = stats.yulesimon.rvs(alpha_true, size=n, random_state=rng)

# Fit alpha by MLE (loc fixed at 0)
res = optimize.minimize_scalar(
    lambda a: -yulesimon_loglik(a, x), bounds=(1e-6, 50.0), method="bounded"
)
alpha_hat = float(res.x)

k_max = 12  # bins: 1..k_max-1, plus tail (>=k_max)
obs = np.array([(x == k).sum() for k in range(1, k_max)] + [(x >= k_max).sum()])

probs = np.append(
    stats.yulesimon.pmf(np.arange(1, k_max), alpha_hat),
    stats.yulesimon.sf(k_max - 1, alpha_hat),
)
exp = n * probs

chi2, p_value = stats.chisquare(f_obs=obs, f_exp=exp)

print("alpha_true:", alpha_true)
print("alpha_hat:", alpha_hat)
print("obs counts:", obs)
print("exp counts:", np.round(exp, 2))
print("chi2:", float(chi2))
print("p-value (naive df; alpha was estimated):", float(p_value))
alpha_true: 2.5
alpha_hat: 2.5961292376240657
obs counts: [3587  827  266  128   63   33   35   17   12    5    4   23]
exp counts: [3609.62  785.36  280.68  127.66   67.22   39.1    24.45   16.15   11.14
    7.96    5.86   24.81]
chi2: 10.821869911711197
p-value (naive df; alpha was estimated): 0.4583050042312996
# 10.2 Bayesian modeling (illustration): 1D grid posterior for alpha

x = stats.yulesimon.rvs(2.5, size=1500, random_state=rng)

alpha_grid = np.linspace(0.2, 10.0, 500)

# Prior: Gamma(a=2, scale=2) (mean=4)
log_prior = stats.gamma.logpdf(alpha_grid, a=2.0, scale=2.0)

log_like = np.array([yulesimon_loglik(a, x) for a in alpha_grid])

log_post = log_prior + log_like
log_post -= np.max(log_post)
post_unnorm = np.exp(log_post)

# Normalize (continuous approximation)
post = post_unnorm / np.trapz(post_unnorm, alpha_grid)

alpha_map = float(alpha_grid[np.argmax(post)])
alpha_mean = float(np.trapz(alpha_grid * post, alpha_grid))

print("posterior MAP:", alpha_map)
print("posterior mean:", alpha_mean)

fig = go.Figure()
fig.add_trace(go.Scatter(x=alpha_grid, y=post, mode="lines", name="posterior"))
fig.add_vline(x=alpha_map, line_dash="dash", line_color="black", annotation_text="MAP")
fig.update_layout(
    title="Posterior over α (grid approximation)",
    xaxis_title="alpha",
    yaxis_title="density",
)
fig.show()
posterior MAP: 2.61563126252505
posterior mean: 2.6291776944815863
# 10.3 Generative modeling: Simon's preferential-attachment process


def simon_process_counts(T: int, p_new: float, rng: np.random.Generator) -> np.ndarray:
    '''Simon's model.

    At each step t:
    - with probability p_new: create a new category
    - otherwise: pick a previous token uniformly and copy its category

    Copying a previous token chooses categories proportional to their current counts.
    Returns the final category counts.
    '''

    if not (0.0 < p_new < 1.0):
        raise ValueError("p_new must be in (0,1)")

    owners = np.empty(T, dtype=int)
    counts = np.zeros(T, dtype=int)

    owners[0] = 0
    counts[0] = 1
    n_types = 1

    for t in range(1, T):
        if rng.random() < p_new:
            type_id = n_types
            n_types += 1
            counts[type_id] = 1
        else:
            type_id = owners[rng.integers(t)]
            counts[type_id] += 1

        owners[t] = type_id

    return counts[:n_types]


T = 50_000
p_new = 0.2
alpha_theory = 1.0 / (1.0 - p_new)

counts = simon_process_counts(T=T, p_new=p_new, rng=rng)

k = np.arange(1, 60)
emp_pmf = np.array([(counts == kk).mean() for kk in k])
theo_pmf = stats.yulesimon.pmf(k, alpha_theory)

fig = go.Figure()
fig.add_trace(
    go.Scatter(x=k, y=emp_pmf, mode="markers", name="Simon empirical", opacity=0.7)
)
fig.add_trace(go.Scatter(x=k, y=theo_pmf, mode="lines", name=f"Yule–Simon α={alpha_theory:.3f}"))
fig.update_layout(
    title="Category-size distribution from Simon's model vs Yule–Simon",
    xaxis_title="k (category size)",
    yaxis_title="fraction of categories of size k",
)
fig.update_xaxes(type="log")
fig.update_yaxes(type="log")
fig.show()

print("T:", T)
print("number of categories:", counts.size)
print("largest category size:", int(counts.max()))
print("alpha_theory (from p_new):", alpha_theory)
T: 50000
number of categories: 9973
largest category size: 8061
alpha_theory (from p_new): 1.25

11) Pitfalls#

  • Invalid parameters: (\alpha) must be positive. In SciPy, passing non-positive values will produce errors or nan.

  • Infinite moments: remember the thresholds:

    • (\alpha\le 1): mean is infinite

    • (\alpha\le 2): variance is infinite

    • (\alpha\le 3): skewness is infinite

    • (\alpha\le 4): kurtosis is infinite

    In finite samples you will still get finite empirical moments, but they can be extremely unstable.

  • Numerical issues:

    • For large (k), the PMF is tiny; compute in log space (logpmf) when doing likelihood work.

    • For large (k), the CDF can be extremely close to 1; using sf/logsf is usually more stable.

    • SciPy’s entropy and higher moments may warn about slow convergence for heavy tails.

  • Sampling extremes: because of the heavy tail, maximum values in a sample can be surprisingly large even when the mean is finite.

12) Summary#

  • yulesimon is a discrete, heavy-tailed distribution on ({1,2,\dots}) with shape parameter (\alpha>0).

  • The PMF is (p(k)=\alpha B(k,\alpha+1)) and the CDF has a simple survival-function form (\mathbb{P}(K>k)=\alpha B(k+1,\alpha)).

  • The (m)-th moment exists iff (\alpha>m); in particular, mean requires (\alpha>1) and variance requires (\alpha>2).

  • A clean sampler comes from the Beta–geometric mixture: (P\sim\mathrm{Beta}(\alpha,1)), then (K\mid P\sim\mathrm{Geometric}(P)).

  • SciPy’s stats.yulesimon provides PMF/CDF/SF/RVS, and stats.fit can estimate parameters by MLE.